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We study the problem of nonparametric estimation of a multi- 
variate function <; : R'' ^ R that can be represented as a composition 
of two unknown smooth functions / : R ^ R and G : R'' ^ R. We sup- 
pose that / and G belong to known smoothness classes of functions, 
with smoothness 7 and /3, respectively. We obtain the full descrip- 
tion of minimax rates of estimation of g in terms of 7 and jS, and 
propose rate-optimal estimators for the sup-norm loss. For the con- 
struction of such estimators, we first prove an approximation result 
for composite functions that may have an independent interest, and 
then a result on adaptation to the local structure. Interestingly, the 
construction of rate-optimal estimators for composite functions (with 
given, fixed smoothness) needs adaptation, but not in the traditional 
sense: it is now adaptation to the local structure. We prove that com- 
position models generate only two types of local structures: the local 
single-index model and the local model with roughness isolated to a 
single dimension (i.e., a model containing elements of both additive 
and single-index structure). We also find the zones of (7, jS) where 
no local structure is generated, as well as the zones where the com- 
position modeling leads to faster rates, as compared to the classical 
nonparametric rates that depend only to the overall smoothness of 
9- 

1. Introduction. In this paper we study the problem of nonparametric 
estimation of an unknown function 5 : M'^ — > M in the multidimensional Gaus- 
sian white noise model described by the stochastic differential equation 

(1) Xe{dt)=g{t)dt + eW{dt), t={ti,...,td)eV, 
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where 2? is a bounded open interval in M"^ containing [— l,!]*^, W is the 
standard Brownian sheet in M.'^ and < e < 1 is a known noise level. Our 
goal is to estimate the function g on the set [—1,1]°' from the observation 
{X£{t),t G D}. For d = 2 this corresponds to the problem of image recon- 
struction from observations corrupted by additive noise. We consider obser- 
vation set V, which is larger than [—1,1]^^ in order to avoid the discussion 
of boundary effects. 

To measure the performance of estimators, we use the risk function de- 
termined by the sup-norm || * on [-1, 1]"': for g : M'^ ^ M, < e < 1, p > 0, 
and for an arbitrary estimator based on the observation G D} 

we consider the risk 

(2) RM,g)=E,m-g\U. 

Here and in what follows Eg denotes the expectation with respect to the 
distribution of the observation {X^{t),t E V} satisfying (1). 

We suppose the g G Gs, where {^s, s G S} is a collection of functional classes 
indexed by s G S. The functional classes Qs that we will consider consist of 
smooth composite functions and below we discuss in detail this choice. 

For a given class Qs we define the maximal risk 

(3) Re{ge,Gs) = sup Reige,g). 

Our first aim is to study the asymptotics, as the noise level e tends to 0, of 
the minimax risk 

mfR^{g^,gs), 

9s 

where inf^^ denotes the infimum over all estimators of g. We suppose that 
parameter s is known, and therefore the functional class Qs is fixed. We find 
the minimax rate of convergence (pei^) on Qs, that is, the rate that satisfies 
(jj^is) x infg; ^s), and we construct an estimator attaining this rate, 

which we refer to as a rate-optimal estimator in the asymptotic minimax 
sense. 

2. Global rate-optimal estimation via pointwise selection. In this section 
we discuss a rather general method of data-driven selection from a given 
family of estimators. This method, called a pointwise selection rule,'^ is at 
the core of the paper. We will use it to construct our rate-optimal estimators. 



^This selection rule was the topic of the IMS Medallion Lecture given by the second 
author at the Joint Statistical Meetings in Minneapolis, 2005. 
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To present the pointwise selection rule we need some definitions. Let Di be 
an open interval such that [-1, l]'^ CT>iCT>. Any function _fC : R'^ x M"^ ^ M 
such that 

/ K{t,x)dt = l VxG[-l,l]'^, 

suppK{-,x)CV VxGPi, 

will be called a weight. Let /C be a given family of weights and let x E [—1, 1]*^ 
be fixed. To any K £ IC we associate a linear estimator at x: 

gK{x)= [ K{t,x)Xe{dt). 
Jv 

We consider a family of linear estimators G{IC) = {gKix),K G /C}. Note that 
gxix) is a normal random variable with variance e^||iir(-, x)||2 where || • ||2 
denotes the L2 norm. Define ax = sup^-g-p ||i^(-,x)||2 and assume that the 
family /C satisfies: 

sup ax < 00. 

For any pair of weights Ki and K2 define the function 

[Ki^K2]{;-)= I Ki{;y)K2{y,-)dy. 

We say that /C is a commutative weight system if 

[Ki ® K2] = [K2 ® Ki] VKi , G /C. 

We now present the pointwise selection rule and briefly discuss some ex- 
amples where it can be applied. The rule consists of the following two steps: 

1. Determination of acceptable weights. Let /C be a commutative weight sys- 
tem and let the(/C) be a threshold whose choice will be discussed below. 
We say that a weight K £ IC [resp., the estimator gxix)] is acceptable if 

\9K^Ki^) - 9Ki^)\ < M(/C)th,(/C)cT^ ViT G /C :a^ > a^, 

where M(/C) = sup^g^,; sup^g-p ^^^^ II ' 111 denotes the Li norm. 

2. Selection from the set of acceptable estimators. Let K, be the set of all the 
acceptable weights in /C. Note that is a random set and it can be empty 
with some probability. li K. ^ we select the estimator gf^{x) with K 
such that o"^ = vai^^^aK, that is, we choose an acceptable estimator 

with minimal variance. If /C = we select an arbitrary fixed estimator 
9Ko{x), where Kq is a given weight from /C. 
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There is no general receipt for the choice of the threshold th^{JC). It may de- 
pend on the weight system, on the nature of the considered problem (point- 
wise or global estimation), on the loss functional, etc. However, if we consider 
the risk (2) and if the weight system /C is not too large (e.g., /C is a metric 
compact with a polynomial behavior of covering numbers) it can be shown 
that there is a universal choice of the threshold: th£(/C) = Ce-\/ln 1/e, where 
C > is a constant depending only on the power p of the loss function and 
on the dimension d. Such a choice of the threshold will be used in this paper. 

A remarkable property of the pointwise selection rule is that it can be 
shown to work for any commutative weight system. As we will see in the 
following examples, the commutativity property is inherent to a variety of 
weight systems used in statistics. 

Examples of commutative weight systems. We now consider some exam- 
ples of commutative weight systems. Let Q be any set of functions Q : M'^ ^ M 
such that supp((5) C [—5, 5]^, 5 > 0, and J^dQ = 1. Take T> = [—a,a]'^ and 
Vi = [—b,b]'^, where a> b> l,a — b> 5 are given numbers. Define 

}C = {K:R'^x R'^ ^R: K{t,x) = Q{t - x),Q e Q}. 

Then /C is a commutative weight system. Indeed, the integration over Di 
in the definition of the weight and in the definition of [Ki ® K2] can be 
replaced by integration over M"', and the operation (g) reduces to the standard 
convolution: 



This allows us to construct various commutative weight systems. We now 
consider some of them. 

The selection of an estimator from a given family first appeared in the 
context of adaptive estimation. In particular, in [16] a pointwise selection 
rule was proposed in order to construct pointwise adaptive estimators over 
a scale of Holder classes. This method was generalized in [21] to a pointwise 
selection rule from the collection Q^fC-yi^) with the family of weights 



where (i = 1, Qo £ Q is a given function, TLi = [/imin; ^max] and the numbers 
< /imin < ^max ^ 1 are chosen by the statistician. In words, the family 
Qi^Hx) consists of kernel estimators with bandwidth varying from /imin to 
^max- The estimator chosen from the collection ^(/Ct-^j) in accordance with 
the pointwise selection rule of [21] is rate optimal over the Besov classes of 
functions; compare [19]. 



[A'l K2] =Ki* K2 = K2 * K'l = [K2(S)Ki]. 
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More recently, pointwise adaptive methods have been developed in dimen- 
sions larger than 1. Thus, [14, 15] propose a pointwise selection rule from 
the collection Q^jC-H^) where 

^^.=|nv'Qo(^),(/ii,...,MGWd|. 

Here the Xi are the components of x, and Hd = nf=i [^min' ^max] with the 
values < h'^-^ < h max < oo, i — 1, . . . , d, that are chosen by the statistician. 
The pointwise selection rule of [14] leads to an estimator that is pointwise 
adaptive over the scale of anisotropic Besov classes [14, 15]. 

The results of these papers show that pointwise selection is a useful tool for 
estimation of functions with inhomogeneous smoothness. Another approach 
to multivariate function estimation is based on structural models. Typical 
examples are the single index model and the additive model (see Section 3 
for more details). For such models, an important issue is adaptation to the 
unknown structure, and it can be also carried out via the pointwise selection 
rule [8]. The weight system used in pointwise selection for the single-index 
model [8] will also appear in some parts of the present paper. It makes use of 
the ridge functions. Another system of ridge functions is proposed in [4, 5] 
for the problem of recovery of functions of two variables with discontinuities 
along smooth edges and smooth otherwise. Note that the approach of [4, 
5] is conceptually different, and does not rely on pointwise selection rules. 
Examples of more complex commutative weight systems can be found in [8, 
20]. Another construction leading to quite an unusual commutative weight 
system will be given in Section 6.2. 

In the present paper we specify the pointwise selection rule for the problem 
of estimation of composite functions. Our structural assumption is that the 
function g:W^ can be represented as a composition of two unknown 
smooth functions / : M ^ M and G : R"' ^ M, that is, g = foG. 

3. Why smooth composite functions. We now discuss why this structural 
assumption is relevant. We start with the following definition. 

Definition 1. Fix a > and L > 0. Let [aj be the largest integer which 
is strictly less than a, and for k = {ki, . . . , kd) G N'^ set \k\ = ki + ■ ■ ■ + kd- 
The isotropic Holder class Md{c(,L) is the set of all functions GiM'^^M 
having on all partial derivatives of order [aJ and such that 
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(4) 



G{y)- E 



< L\\y — X 



a 



0<|A:|<N 



Vx, y e W^, 




where Xj and yj are the jth components of x and y and || • || is the Euchdean 

norm in W^. 

Parameter a characterizes the isotropic (i.e., the same in each direction) 
smoothness of function G. 

Let now / and G be smooth functions such that / G EIi(7,Li) and G G 
]H[d(/3, L2) where 7, Li,(i, L2 are positive constants. Here and in what follows 
E[i(7,Li) and IHIrf(/3, L2) are the Holder class on M and the isotropic Holder 
class on M"^, respectively. The class of composite functions g = f{G{x)) with 
such / and G will be denoted by M.{A,C), where A = {'y,P) G IR+ and C = 
(Li,L2)GM^. 

The performance of an estimation procedure will be measured by the 
sup-norm risk (3) where we set s = {A,C) and Gs = H(^,£). 

3.1. Motivation 1: models of reduced complexity. It is well known that the 
main difficulty in estimation of multivariate functions is the curse of dimen- 
sionality: the best attainable rate of convergence of the estimators deterio- 
rates very fast as the dimension grows. To illustrate this effect, suppose, for 
example, that the underlying function g belongs to Qs = IHIrf(a, L),s= {a,L), 
a > 0, L > 0. Then the rate of convergence for the risk (3), uniformly on 
Md{a,L), cannot be asymptotically better than 



(cf. [6, 12, 13, 23, 25]). This is also the minimax rate on M(i{a,L); it is 
attained, for example, by a kernel estimator with properly chosen bandwidth 
and kernel. More results on asymptotics of the minimax risks in estimation 
of multivariate functions can be found in [2, 3, 14, 15, 22]. It is clear that 
if a is fixed and d is large enough this asymptotics is too pessimistic to be 
used for real data. 

At the origin of this phenomenon is the fact that the d-dimensional 
isotropic Holder class Hrf(a, L) is too massive in terms of its metric entropy. 
A way to circumvent the curse of dimensionality is to consider models with 
slimmer functional classes (i.e., classes with smaller metric entropy). There 
are several ways to do it. 

• A first way is to impose a restriction on the smoothness parameter of 
the functional class. For the class ]HI^(a,L), a convenient restriction is to 
assume that the smoothness a increases with the dimension, and thus the 
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class becomes smaller (its metric entropy decreases). For instance, we can 
suppose that a = Kd with some fixed k> 0. Then the dimension disappears 
from the expression for ^'£,^(0), which means that we escape from the 
curse of dimensionality. However, the condition a = nd 01 other similar 
restrictions that link smoothness and dimension are usually difficult to 
motivate. An interesting related example is given by the class of functions 
with bounded integrals of the multivariate Fourier transform [1]. 
• One can also impose a structural assumption on the function g to be 
estimated. Two classical examples are provided by the single-index and 
additive structures (cf., e.g., [7, 9, 11, 26]). 

The single-index structure is defined by the following assumption on g: 
there exist a function Fq :M ^ M and a vector G M'^ with \\'d\\ = 1 such 
that g{x) = Foi'&^x). 

The additive structure is defined by the following assumption: there 
exist functions i^j : M ^ M, i = 1, . . . , d, such that g{x) = Fi{xi) + • • • + 
Fd{xd), where Xj is the jih. component of x G W^. 

If we suppose that Fj € ]HIi(q, L), i = 0, . . . , d, then in both cases func- 
tion g can be estimated with the rate (e-\/ln (l/e))^"/(^"+-^) , which does 
not depend on the dimension and coincides with the minimax rate ^£^i(a) 
of estimation of functions on M. 

In general, under structural assumptions the rate of convergence of esti- 
mators improves, as compared to the slow d-dimensional rate ipe,d{^)- For 
the above examples the rate does not depend on the dimension. 

However, it is often quite restrictive to assume that g has some simple 
structure, such as the single-index or additive one, on the whole domain of 
its definition. In what follows we refer to this assumption as global structure. 

A more flexible way of modeling is to suppose that g has a local structure. 
For instance, we can assume that g is well approximated by some single-index 
or additive structure (or by a combination both) in a small neighborhood 
of a given point x. Local structure depends on x and remains unchanged 
within the neighborhood. Such an approach can be used to model much more 
complex objects than the global one. However, the form of the d-dimensional 
neighborhood and the local structure should be chosen by the statistician 
in advance, which makes the local approach rather subjective. 

In the present paper we try to find a compromise between the global 
and local modeling. Our idea is to consider a sufficiently general global 
model that would generate suitable local structures, and thus would allow 
us to construct estimators with nice statistical properties. We argue that this 
program can be realized for global models where the underlying function g 
is a composition of two smooth functions. 
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3.2. Motivation II; structure- adaptive estimation. The problem of esti- 
mation of a composite function can be viewed as that of structural adapta- 
tion. Indeed, let us suppose that the function G is known and /3 > 1. It is 
easy to see that in this case the function g can be estimated with the rate 
^'£,1(7) corresponding to that of estimation of the univariate function / of 
smoothness 7. 

Thus, the function G can be considered as a functional nuisance param- 
eter characterizing the unknown structure of the function g. An important 
question in this context is: what is the price to pay for adaptation to the 
unknown G? 

Note that the composite model is a kind of generalization of the single- 
index model; instead of the linear function in the latter model we have here 
a general function G. As discussed above, for the single-index model the 
optimal rate equals to V'saCt)- ^^^^ show that in the general situation 
when G is nonlinear, the optimal rate of convergence on EI(^, C) [that we 
denote V'el-^)] is slower than ipe^iij), that is, '(/'e,i(7)/V'e(-^) ^ 0,e — > 0. 

It is easy to see that the class IHI(^,£) is contained in the Holder class 
Md{a^^l3,L3), where L3 = L3(£) and 



This inclusion implies that if we ignore the composition structure, that is, 
if we simply suppose that g £ EI(a^^^,L3), then we can only guarantee the 
rate of convergence ipe4{oL^^p). On the other hand, it follows from our results 
given below that ipei^) /'^e,d{(^'y,p) — > 0,e — > 0, for various values of the regu- 
larity parameter A. In other words, the knowledge of the fact that we have a 
composition structure allows us to improve the rate of convergence as com- 
pared to the rate of the best estimator, which only relies on the smoothness 
properties of g. 

However, for certain values of the parameter A = (7,/3) no improvement 
due to the structure can be expected. This happens when the structural as- 
sumption is essentially equivalent to the fact that g belongs to some isotropic 
Holder class. This effect takes place for the following values of (7,/?) G M^: 

1° . < 7,/3 < 1 (zone of slow rate). Clearly, in this zone M.{A,C) C 
E[^(7/3, L3), where L3 is a positive constant depending only on 7,/3 and 
C. Due to this inclusion a standard kernel estimator with properly cho- 
sen bandwidth and the boxcar kernel converges with the rate tpe,d{l(^) = 
(eyMT7e))2^^/(^'^^+'^). It is not hard to see (cf. Theorem 1) that this rate 
is optimal, that is, that a lower bound on the minimax risk holds with the 
same "slow" rate ipe^ilP) (note that 7/3 < 1). 





otherwise. 
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■2°. 7 > /?,7 > 1 (zone of inactive structure). In this zone we easily get 
the inclusions E[rf(/3, L4) C M.{A,C) C ]H[d(/3, L5), where L4 and L5 are posi- 
tive constants depending only on /3 and C. To show the left inclusion it suf- 
fices to consider a set of composite functions with linear / and G £ ]HIrf(/?, L). 
Therefore, the asymptotics of the minimax risk on M{A, C) is the same as 
for an isotropic Holder class 1HI^(/?, •), that is, the minimax rate on this class 
is ip£^d{0)- Note that here we estimate as if there were no structure, and the 
asymptotics of the minimax risk does not depend on 7. This explains why 
we refer to this zone as that of inactive structure. 

We finally remark that if /3 < 1 the composite function g is rather non- 
smooth. The effective smoothness equals to (1 A 7)/?, and in view of the 
above discussion, the minimax rate of convergence of estimators on E[(^, C) 
is the same as on the Holder class IHI^((1 A 7)/?, •). This is a very slow rate 
i^e^ii^ Therefore, only for /? > 1 one can expect to find estimators 

with interesting statistical properties. 

4. Main results. In this section we state the main results and outline 
the estimation method. The formal description of the estimation procedure 
and the proofs are deferred to Sections 5 and 7.1-7.2, respectively. 

4.1. Lower bound for the risks of arbitrary estimators. For any A = 
(7,/3) EM^ define 

if /3> l,/3>d(7-l) + l, 
if 7>l,/3<d(7-l) + l, 
if (7,/3)G(0,l]2. 

The boundaries between the zones of these three different rates in are 
presented by the dashed lines in Figure 1. 

An asymptotic lower bound for the minimax risk on M{A,C) is given by 
the following theorem. 

Theorem 1. For any A = (7, /3) G and any p> we have 
liminfinf sup Eg[{(l)i\-f, - glW] > 0, 

where infg^ denotes the infimum over all estimators of g. 

The theorem states that the rate of convergence (^e(7, (3) cannot be improved 
by any estimator. We will show below that for < 7, /3 < 2 there exist es- 
timators attaining this rate. Before proceeding to the corresponding result, 
we make several remarks on the properties of the rate (psiliP). 



(5) Ml,P) 
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Fig. 1. Zones of improved rate, of slow rate and of inactive structure. Dashed lines 
delimit the zones of three different expressions for the rate (j)e . 

Remarks. 1. The set = (7,/3) :/3 > 7,/3 > 1} will be referred to as 
the zone of improved rate (cf. Figure 1). In this zone there is an improvement 
of the rate of convergence due to the structure. Indeed, if A belongs to this 
zone, the smoothness of function g is equal to a^^p = 7 (cf. Section 3.2), and 
hence our rate (psi^jf^) is asymptotically (as e ^ 0) much smaller than the 
rate ipe,d{(^'y,i3) obtained for the estimators that take into account only the 
smoothness, and not the structure. 

2. The parameter /3 is the tuning parameter of the model: when the ratio 
d/P tends to 0, the rate (/>£(7,/3), depending on the value of 7, approaches 
either the one-dimensional Holder class rate ^^£,1(7) or the "almost paramet- 
ric" rate e-\/ln (1/e). In particular, when /? > 7 > 1 and /? < ^(7 — 1) + 1 the 
rate of convergence (j)s{l-,f3) does not depend on 7 and coincides with the 
minimax rate il^e,d{(^) associated to the d-dimensional Holder class ]H[^(/3, •), 
and in this zone the composite function g = f oG can be estimated with the 
same rate as G, independently of how smooth is /. 

3. Theorem 1 states the lower bound [e^Jln (i/e) )27/(27+i+(rf-i)//3) ^ ^y^^^^ 

is valid for all positive 7,/3. Inspection of its proof shows that for d = 2 the 
lower bound is attained on the functions of the form /o(v?i(ti) + (p2{t2))- 
Here /o is a function of Holder smoothness 7 and both functions <fj,j = 1, 2, 
are of Holder smoothness f3. So, for d = 2 the lower bound with the rate 
(eyhr(T7e))2T/(27+i+i//3) i^Qijg for ^i^g^^ functional family for any 7 and f5. 

Note that when 7 = /?, this lower rate becomes (ey/ln (l/e))'^^ Z*-^^ 
Since 2|3i+|3-^.l < 2p+i ^^^^ always slower than the classical one-dimensional 
rate £2/3/(2/3+1) q^^ ^.j^g other hand, a recent result of [10] shows that for 
7 = /3 functions of the form /o(¥'i(^i) + ^2(^2)) can be estimated at the rate 
^2/3/(2/3+1) jj-^ ^Yie L2-norm. Thus, we observe that there is a significant gap 
between the optimal rates of convergence in L2 and in L^o, in contrast to the 
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classical nonparametric estimation problems where these rates only differ in 
a logarithmic factor. 

4.2. Outline of the estimation method. The exact definition of our es- 
timator is given in Section 5. Here we only outline its construction. We 
suppose that A = (7,/?) E (0,2]^. The initial building block is a family of 
linear estimators. In contrast to the classical kernel construction, which in- 
volves a unique bandwidth parameter, the weight Kj that we consider is 
determined by the triplet J' = {A, i}, A) where the form parameter A is the 
couple (7, /3) G (0, 2]^, the orientation parameter i9 is a unit vector in M.'^ and 
A is a positive real, which we refer to as size parameter. We denote 3 the set 
of all such triplets J and consider a family of linear estimators {gj,J^ € 3) 
where for any x G [—1, 1]"^ the estimator gj{x) of g{x) is given by 



Note that here the size parameter A does not represent the bandwidth of the 
classical kernel estimator, but rather characterizes the bias of the estimator 
gj when the orientation of the window i? is correctly chosen. Namely, the 
weight Kj is chosen in such a way that for each x € [—1, l]'^ the bias of gj 
is of the order 0(A) if = "i^g is collinear to the gradient VG(x). 

The estimation method proceeds in three steps, and the basic device un- 
derlying the construction of the optimal estimation method is the notion of 
the local model. It is an important feature of the composition structure that 
different local models arise in different subsets of the zone of improved rate. 

Step 1: specifying a collection of local models. The underlying function 
g of complicated global structure can have a simple local structure. How- 
ever, the local structure depends on the function itself. Therefore, g can be 
only described by a collection of local models. In our case, this collection is 
indexed by a finite-dimensional parameter that can be considered as a nui- 
sance parameter. Specifically, we pass from the global composition model 
defined in Section 3 to a family of local models {M. j{x),J E 5, £ [—1, l]*^} 
where the type of each local model Mj{x)., J = {A,i},X), is determined by 
A, while and A are the local orientation and size parameters. Depend- 
ing on the value of ^ = (7,/?) (cf. Figure 2), our global model induces only 
two types of local models: a local single-index model and the model with 
roughness isolated to a single dimension (local RISD model). 

1° . Local single-index model: 7 < 1,1 < /? < 2. In this domain of 7,/5, 
using the smoothness properties of functions / and G, it is not hard to 



show that in the ball Bx,x{A) = {t e M'^: ||t - x\\ < X^/i'y^)} the composite 
function g{-) can be approximated with the accuracy 0(A) by the function 
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Fig. 2. Types of local structures. 



f{G{x) + — x]). Here = t?Q is a unit vector collinear to the gradient 
VG{x). Indeed, since the inner function G belongs to L2), for any 

x,t we have 

(6) G{t) = G{x)+VG{xf{t-x) + B^{t) with < L2P - . 

Next, using the fact that / G Hi (7,^1), we conclude that g{t) = f{G{t)) 
admits the representation 

git) = Q^{t)+C^{t), 

where 

QAt) = f{G{x)+VG{xf{t-x)) 

and 

\Ca:{t)\<Li\B^{t)p<Lim\t-x\\'y''. 

In other words, for any weight K with the support on the ball Bx{A) = {t £ 
M.'^:\\t\\ < A^/t^} and such that jK{y)dy = l, 

(7) J K{t - x)[g{t) - QS)] dt = 0(A). 

We understand the relation (7) as the definition of the local single- index 
model Qx of g. The choice of the approximation weight for the function g 
is naturally suggested by the form of the local model Qx together with the 
bound (7): the weight Kj can be taken as the indicator function of a hyper- 
rectangle normalized by its volume and oriented in such a way that 'VG{x) 
is collinear to the first basis vector in M'^. The sides of the hyperrectangle 
are chosen to have the lengths li = A^/''' and Ij = X^/^"'f^\j = 2, . . . ,d — 1. 
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Local model with roughness isolated to a single dimension (RISD): 
1 < 7 < /? < 2. Let be an orthogonal matrix with the first column equal 
to 1? = "i^Q, and let y = Mj{t — x),t G M*^. We denote yj the jth component 
of y and consider the set 

(8) ^A,x(^)={iGlR'^:|yi|<Al/^ ||y||<Al/(7/3), \y,\''~^\\yf<X}. 

We show that the estimation of the composite function g at x can be reduced 
to the problem of estimation under the local model 

Qx{y) =qx{yi) + Px{y2,---:yd), 

where Qx eHi(7,LiL]) and Px G IHIrf_i(/3, 2L1L2) on the set Xx^xi-^)- This 
local model is established in an unknown coordinate system determined by 
the parameter = i?q. Since the smoothness 7 of qx is smaller than the 
smoothness P of Px, the accuracy of estimation that corresponds to the 
coordinate yi is coarser than that for other coordinates. This motivates the 
name roughness isolated to a single dimension. 

The explanation of the local model represented by Qx on the set ^a,x(-^) 
is provided by the following argument. Using the smoothness properties 
of functions / and G, we obtain due to the inclusions / G Hi (7,^1), G G 
IHd(/?,L2): 

g{t) = f{G{x) + VG{xf{t - x)) + f'{G{x) + VG{xf{t - x))Bx{t) + Gx{t) 

= f{G{x) + VG{xf{t - x)) + f'{G{x))Bx{t) + Dx{t) + Gx{t), 
where 

\Gx{t)\<G{Li,L2,^)\\t-x\\'^^, 

\Dx{t)\ < C{L,M)^^^^^^^^^\\t - xf, if VG(rr) / 

[we have Dx{t) = when VG(x) = 0], and the function Bx{t), which is de- 
fined in (6), belongs to the class ]HIrf(/3, 2L2). In the transformed coordinates 
(determined by the orthogonal matrix M^) we may write 

(9) g{t) = g{x + M^y) = q{y{) + Bx{y) + bx{y) + ^(y), 
where 

(10) \Dx{y) + Cx{y)\<c{LuL2,Myir-^\\yf + \\yV^) 

and Bx G IHIrf(/3, 2L2). The latter inclusion leads to 



(11) 



d ~ 

Bx{y) - Px{y2, ■ ■ ■ ,yd) -yi^Bx{0,y2,---,yd) 

oyi 



<2L2\yi\ 
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where Px{y2, - ■ ■ ,yd) = Bx{0,y2, ■ ■ ■ ,yd)- Let again K be a weight such that 
/ K{t) dt = 1, supported on X\^x{^)- Then 



if K is symmetric in yi. We understand this property as the definition of 
the RISD local model Qx for the composite function g. 

We conclude that if A belongs to the zone marked as "RISD" in Figure 2, 
the global structural assumption that the underlying function is a composite 
one leads automatically to a local RISD structure. 

A good weight Kj for the zone of RISD local model should be supported 
on the right window Xx^^i.^), possess small bias on both single-index com- 
ponent qx and "regular" component Px and have a small L2-norm to ensure 
small variance of the stochastic term of the estimation error. The construc- 
tion of such a weight is rather involved (cf. Section 6.2). Note that using a 
rectangular weight, as for the local single-index model leads to suboptimal 
estimation rates. 

As we see, the definition of local model has two ingredients: the neigh- 
borhood (window) and the local structure within the window. For the local 
single-index model the window is just an Euclidean ball, whereas for the 
RISD local model the window is the set Xx.x{.A). 

Step 2: optimizing the size parameter and specifying candidate estima- 
tors. Once the local model is determined and the corresponding weight is 
constructed we can choose the size parameter A = Ag {A) in an optimal way. 
To do it we optimize our sup- norm risk with respect to A, that is, we get the 
value A, which realizes the balance of bias and variance terms of the risk in 
the ideal case where the orientation i? = ??q is "correct" for all x. 

Recall that the weight Kj supported on the window is chosen in such a 
way that the bias of the linear estimator gj, for the "correct" orientation t?, 
is of the order 0(A) on every local model. Thus, the bias-variance balance 
relation for the sup-norm loss can be written in the form 



We will see that 11X^-112 depends on A and A but does not depend on -d. 
This will allow us to choose the optimal value \e{A) independent of i9. For 
instance, for the local single-index model (when 7 < 1) the weight Kj is 
just a properly scaled and rotated indicator of a hyperrectangle. In this 
particular case the bias- variance balance (13) can be written in the form 





(13) 
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Note that in this case XeiA) x cpsilj^), where (pi.{'y,(3) is defined in (5). 
With Xe{A) being chosen, we obtain a family of hnear estimators 

(14) {9jix),J = (A ^, Xe{A)) G 5, X G [-1, 1]'^}. 

For a fixed x G [—1,1]'^ this family only depends on two parameters, A and 

Step 3: selection. We now choose an estimator from the family (14) that 
corresponds to some j" G 5 selected in a data-dependent way, and define our 
final estimator as a piecewise-constant approximation of the function x 
gj{x). To choose j we apply the pointwise selection procedure presented 
in Section 2. 

We introduce a discrete grid on the unit sphere G M"^: = 1}, and 
we divide the domain of definition of x into small blocks. For each block, 
we consider a finite set of estimators gj{x) extracted from the family (14), 
with x, which is fixed as the center xq of the block and all the •& on the grid. 
We then select a data-dependent ^ from the grid applying our aggregation 
procedure to this finite set. The value of our final estimator g*j^^ on this 
block is constant and is defined as g\^{x) =9(^_^^ x thus get a 

piecewise-constant estimator g"^^ on [—1,1]'^ that depends only on A and 
on the observations (the exact definition of ^ is given in Section 5) . 

Remarks. In this paper we assume that the smoothness A = (7,/?) is 
known, and we deal only with adaptation to the local structure determined 
by If ^ is unknown we need simultaneous adjustment of the estimators 
to A and to that is, to the smoothness and to the local structure of the 
underlying function. Note, however, that parameters A and i? are not inde- 
pendent. In particular, A determines the form of the neighborhood where we 
have an unknown local structure depending on ■!?. This is important because 
our construction of the family of estimators {gj,J G Z} strongly relies on 
the local representation of the model. For example, if the family {gj,J ^Z} 
does not contain an estimator corresponding to the correct local structure, 
the choice from this family cannot even guarantee consistency. Another dif- 
ficulty is that different values of A can correspond to different types of local 
models (cf. Figure 2). In other words, the problem of adaptive estimation of 
composite functions turns out to be more involved than the classical adapta- 
tion to the unknown smoothness as considered, for example, in [16, 17, 18]. 
As yet we do not know whether fully adaptive estimation in this context is 
possible or not. 
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Fig. 3. Classification of zones within (0,2]^. 

4.3. Upper bounds on the risk of the estimators. We define the following 
three domains of values of ^= (7,/9) contained in (0,2]^ (cf. Figure 3). 

ri = {A:-f< l,l<p<2}, 

(15) V2 = {A:l<j<P<2,(3>d{j-l) + l}, 

V3 = {A:l< J < P <2,p < d{j - 1) + I}. 

In view of the above discussion, these are exactly the zones where improved 
rates occur and where the local structure is active. For the sake of complete- 
ness, we consider also the remainder zone (zone of no local structure): 

7'4 = (0,l]'u{(7,/3):l</3<7<2}. 

As we will see in Section 6.2, the optimal weights Kj are defined separately 
for each of these zones. 

Theorem 2. Let (peil,^) be as in (5). For any A={-f,fi) G (0,2]^ 
and any p> the estimator g"^ ^ satisfies 

limsup sup Eg[{cl)J^{j,P)\\g\^ - gWoof] <oo. 
e^O gm{A,C) 

For any A = (7, (3) € P2 dn-d any p> the estimator g*j^ ^ satisfies 

limsup sup E3[([lnln(l/e)]-Ve"'(7,/3)ll5Ae-5l|oo)T<oo. 
e^O gGH(yl,/;) 

Combining Theorems 1 and 2 we conclude that (j)s{'y,P) is the minimax rate 
of convergence for the class EI(^, £) if ^ = (7, /?) G (0, 2]^ \ 7^2, and that it is 
near minimax [up to the lnln(l/e) factor] if ^ = (7,/?) S V2- Therefore, our 
estimator g*^ ^ is, respectively, rate optimal or near rate optimal on M{A, C). 
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Theorem 2 is in fact a result on adaptation to the unknown local structure 
of the function to be estimated: the estimator g'^^ locally adapts to the 
"correct" orientation t^q, which is collinear to the gradient VG(x) at x. 

Remarks. We consider here the Gaussian white noise model because its 
analysis requires a minimum of technicalities. Composition structures can 
be studied for more realistic models, such as nonpar ametric regression with 
random design, nonparametric density estimation and classification. Note 
that our theorems can be directly transposed to the Gaussian nonparamet- 
ric regression model with fixed equidistant design using the equivalence of 
experiments argument (cf. [24]). Note also that results similar to ours have 
been recently obtained for the problem of testing hypotheses about compos- 
ite functions in the Gaussian white noise model [20]. 

We prove the upper bound of Theorem 2 only for the case A £ (0,2]^. 
An extension to ^ ^ (0,2]^ remains an open problem. On the other hand, 
the lower bound of Theorem 1 is valid for all A £ M^. We believe that it 
cannot be improved. This conjecture is supported by the recent results on 
a hypothesis testing problem with composite functions [20], which is closely 
related to our estimation problem. The upper bound proved in [20] for all 
A E in the problem of hypothesis testing coincides with the lower bound 
of Theorem 1. 

The rate of convergence of the minimax procedure (cf. Theorem 2) in the 
zone V2 contains an additional lnln(l/e) factor, as compared to the lower 
bound of Theorem 1. We believe that this minor deterioration of the rate 
can be avoided by using a more refined estimation procedure. 

5. Definition of the estimator and basic approximation results. We first 
introduce some notation. For a bounded function K € and p > 1 we 

denote by ||i^||p its Lp-norm and hy K * g its convolution with a bounded 
function g: 



(here and in the sequel / = /jjd). We denote J' = {A, ??, A) where A = (7, /3) G 
(0, 2]^, 7? is a unit vector in M*^ and A > 0. The class of all such triplets is 
denoted by 5- 

Given a unit vector let G M'^^'^ stand for an orthogonal matrix with 
the first column equal to ??. The weight system we consider in the sequel is 
defined as 




^j(^) = K(^,A)(Mjx), 
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where ^{a,x) — > 1R is a weight that wih be defined in Section 6. Next, for 
any J' ^ G 3 and ah t G M'^ we define the convoluted weight 

Kj,*j{t) = J Kj,{t-y)Kj{y)dy 

and the difference 

^JiKji*j = Kji^j - Kji. 
We require the weight Kj to be symmetric, that is, Kj{t) = Kj{—t), and 

(16) Kji^j = Kj^j,. 
For all J ^2 and all x G [-1, 1]'' set 

gj{x)= f Kj{t - x)X,{dt) 
Jv 

and for all J^',J^(z2 define the convoluted estimator 

9j'*j(.x)= Kj>^j{t-x)Xs{dt). 
Jv 

In what follows we assume e is small enough so that in all expressions that 
involve weight convolutions we can replace by J^d (recall that weights 
we consider are compactly supported). We also suppose that lnln(l/e) > 0. 
Define 

^j'9j'*j{x) = gj'*j{x) - gj'ix) 

and set 

TU,{j',J) = C{p,d)i\\Kj,\\, + \\KM\\Kj,\\2e^lnil/e), 
where C{p,d) = 2 + y/Ap + 8d. 

5.1. Estimation procedure. Now we need to introduce a discrete grid on 
the set of indices 5. We discretize only the i?-coordinate of J'. Recall that •& 
takes values on the Euclidean unit sphere S in W^. 

Discretization. Let C S be an e-net on §, that is, a finite set such that 

WeS 3^' e8s:\\^-^'\\<e 

and card(Se) < (-v/d/e)'^"-'^ for small e. Without loss of generality, we will 
assume that (1, 0, . . . , 0) G S^. 

Fix ^G (0,2]2 and define XeiA) as a solution in A of the bias-variance 
balance equation 

(17) CiA = ev'ln(l/£)||K(^,,)||2, 

where Ci is a constant in Proposition 2 below, depending only on A, £ and 
d. Finally we define the grid on J: 

dgrid = {J= (A, \e{A)) : 7? G SJ C 3. 
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Acceptability. For a given x G [—1,1]'' we define a subset ix of 3grid as 
follows: 

Jeix ^ \Aj,gj,,j{x)\<TIl,{j',J) Vj'G5gnd. 

Any value J' £ 5grid that belongs to is called acceptable. 

Note that the threshold T'H.^{J'' ,J') can be bounded from above and 
replaced in all the definitions by a value that does not depend on J, J' G 
agrid- In fact, either TYi^{J' ,J) x \^{A) if ^ G Pi U or THe(J', J) x 
lnln(l/e)Ae(^) if ^£^2. 

Estimation at a fixed point. For any x E [-1,1]'' such that Txy^0 we 
select an arbitrary J'x from the set T^- Note that the set Zx is finite, so a 
measurable choice of JT^ is always possible; we assume that such a choice is 
effectively done. We then define the estimator g** (x) as follows: 



(18) g**{x) 




if 1x / 0, 
if = 0. 



Global estimator. The estimator (7** is defined for all x € [— 1,1]^^ and 
we could consider x 1— > (j(**(a;),x G [—1, 1]'', as an estimator of the function g. 
However, the measurability of this mapping is not a straightforward issue. To 
skip the analysis of measurability, we use again a discretization. Introduce 
the following cubes in MJ^: 

d 

Ueiz) = (^[e\zk - l),e^Zk], z = (zi, . . . ,Zd) £ if. 

k=l 

For any x £ [—1,1]'' we consider z{x) e Z'' such that x belongs to the cube 
Jle{z{x))., and a piecewise constant estimator g**{z{x)). Our final estimator 
is a truncated version of g**{z{x)): 

nq^ i,^^[r{z{x)), if|/*(^(x))|<lnln(l/e), 

^ > ^-^'^^^ \lnln(l/e)sign(/*(z(x))), if |/*(z(x))| > lnln(l/e). 

Thus, the resulting procedure g*_^ ^ is piecewise constant on the cubes '^e{z) C 
[-1,1]'^, zGZ''. 

Remark. Some comments on the numerical complexity of the proposed 
method are in order here. The algorithm of this section can be easily re- 
formulated for the problem of estimation of the signal g{i) at n points of 
a regular grid in [0,1]'', from independent observations y{i) = g{i) 
^(i) ~ AA(0, 1), i = 1, . . . ,n. A standard argument results in the equivalence 
between the two models when e x n~^/^, [24]. 
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According to the definition of our method, at each point we need to com- 
pare N = 0{n^'^~^^/'^) estimators which correspond to the grid over on the 
unit sphere of dimension d — 1. There are two main components of the numer- 
ical effort: we need to compute A'^^ convoluted weights and the convolutions 
of these weights with the observation y. It will cost 0(n) elementary opera- 
tions to implement the construction of Section 6.2 for each of N weights, and 
then 0(77, In 77,) operations to compute each of N'^ convolutions. The numer- 
ical complexity of this step is therefore 0{N^nlnn) = 0{7i'^lnn). Further, 
the convolution of y with each weight requires 0(77 In 77) operations. Thus 
the total cost of convoluting all N"^ weights with y will be, again, 0(77'^ In 77). 
Finally, choosing the estimator from the family at each point of the grid de- 
mands N"^ comparisons. We conclude that the total effort will be 0(77"^ In 77) 
elementary operations, which is far from being prohibitive for dimensions 
d = 2 and d = 3 that are of interest in the context of image analysis. 

5.2. Basic approximation results. We can now describe the approxima- 
tion properties of the weight Kj, which serve as a main tool in the proof of 
the properties of the estimator g'^^{x). 

Let X e [-1,1]'^ and A = (7,/?) G (0,2]^ be fixed and let g = f oG e 
M{A,C)- We define 

/on^ ox a/ (1,0,..., 0), if /3 > 1 and VG(2;) = or /3 < 1, 

l^uj 77o - |vG(2;)/||VG(x)||, if /? > 1, VG(x) / 0. 

The following statement is an immediate consequence of Lemmas 1-4 for- 
mulated in the next section: 

Proposition 1. For all A= (7,/5) G (0,2]^, and all X>0 we have 

sup sup \[Kj.^ * g]{x) - g{x)\ < C2A, 
x£[-i,i]'' 9m{A,c) 

where Jq = (^,7?q, A) and C2 only depends on A, L and d. 

In other words, the weight system {Kj,^ £ 5} contains an element Kj^ 
such that the quality of approximation of g{x) by the "ideal" smoother 
[Kjx * g]{x) is of the order 0(A). Here we use the term "ideal" because 
J7(f = (^,7?Q,A) depends on the gradient VG(x), and thus on the unknown 
function g. 

The following property of weights Kj is used in the proof of Theorem 2. 

Proposition 2. For allA={j,P) G (0,2]^ xG [-1,1]'', 0< A< 1 and 
all J = {A, 7?, A) G 5 we have 

sup sup I [AjKj^j^ * g] (x) \ 
yie{0,2]2 gm{A,C) 

(21) 

<Ci{(||i^^||i + ||K^.||i)A+||K^||i||i^^.||ie}, 
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where = {A,'d''^, A), t?^ is any element of the unit sphere S such that — 
t?q|| < £ and Ci is a constant depending only on A, C and d. Furthermore, 
for any J ,J' we have 

(22) \\Aj,Kj,,j\\2 < iWKj'Wi + \\Kj\\i)\\Kj,h. 

6. Weight systems and properties of the weights. Depending on the 
value of A [different zones Vi (cf. Figure 3)] we use different constructions 
of y^(A,x) ■ objective is to obtain Kj with suitable approximation prop- 
erties for each ^7 S 5. Let us summarize here the main requirements on the 
weight: 

1. Convolution of the weight \^{a,x) with the "local model" of g correspond- 
ing to A should approximate g with the accuracy 0(A). Furthermore, the 
weight should be localized, that is, it should vanish outside of the window 
where the local structure is valid. 

2. A basic characteristic of the weight is its L2-norm, which determines the 
variance of the estimator. Our objective is to achieve its minimal value. 

3. The Li-norm of the weights is also an important parameter of the pro- 
posed estimation procedure since it is inherent to the definition of the 
threshold. Our objective will be to keep the Li-norm as small as possi- 
ble. 

We start with formulation of the properties of the weights, which allows 
us to prove the basic approximation result and to find the parameters of 
our estimation procedure. The explicit description of weight systems will be 
given in the end of the section. 

6.1. Properties of the weights. 

Zone V4 (no local structure). 

Lemma 1. For any A = (7,/?) G V^, A > and x G [-1, l]'^, we have 

sup \[^{A,x) * 9]{x) - g{x)\ < cqX, 
gm{A,C) 

where the constant cq depends only on C and d. Furthermore, 

||K,,.J| -1 and IIKm.JI - [ i'^>^"^''^r'"^ (t,/?) G (0, Ip, 
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Zone Vi (local single-index model). Let gr:M— >]R and i^iM'^— >M be 
functions such that, for given 7 G (0, 1], 

\q{x) - q{y)\ < L\x - Vx.yGM'^, 
sup |-B(x)| < ci, 

where ci > 0, L > are constants. We denote by 21(7) the set of all pairs of 
functions (g, B) satisfying these restrictions. Define 

We have the following evident result: 

Lemma 2. For any A = (7, /?) G Vi and X>0 we have 

(i) sup |[K(^,A)*Q](0)-g(0)| <C2A, 

(q,B)ea(7) 

where C2 is a constant depending only on L, ci and d. Moreover, 

(ii) ||K(^,,)||, = 1 and ||K(^,,)||, = (2'^aV7+('^-^)/(-'3))-^/1 

Zone V2 U P3 (RISD local model ). Let g : M ^ M and p : M'^ ^ M, S : M'^ ^ 
M be functions such that p is continuously differentiable and, for given A = 
(7, /3) e 7^2 U and A > 0, 



(23) 



1 



< C3A, 



(24) |p(z')-p(^)-[Vp(z)]^(z'-z)| <L||z'-zf Vz,z'gM'^, 

(25) sup |5(x)| <C4, 



where 03,04 and L are positive constants. Let QS(^, A) denote the set of 
triplets {q,p,B) satisfying (23)-(25). Define 

Q{y) = q{yi) +P{y) + B{y)\y,r~%\f Vy G M"^. 



Lemma 3. Let A = (7,/?) G "Ps. Then, for any A > small enough, 

(26) sup |[K(^,;,)*Q](0)-Q(0)| <cA, 

(g,p,B)G23(^,A) 

(27) / |K(^,A)(y)|||?/ir^^n< c'A"^/(T^) Vm G M, 
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where the constant c depends only on €3,04, L,d and A, and d depends only 
on m, d and A. Furthermore, 

(28) l|K(^,A)|li<c" and ||K(^,,) < c(3)a-'^/(2/3), 

where the constants a" and c^^^ only depend on A and d. 



The weight ^(a,\) depends on ^ = (7, (3) in such a way that the constants 
in the bounds (26)-(28) diverge when A approaches the boundary d{'y — 1) + 
1 = /3 of the zone V3. So, Lemma 3 cannot be extended to ^ G ^2- 

We consider now another construction that provides the weight ^{a,x) 
with the properties similar to those of Lemma 3 but satisfied for ah A G 
V2 UPa and, what is more, uniformly over this set. The price to pay for the 
uniformity is an extra loglog(l/A) factor in the bound for the Li-norm of 

Lemma 4. Let ^ = (7, /3) G U ^3. Then, for any A > small enough, 

(29) sup |[K(^,A)*Q](0)-Q(0)| <C5A, 

(q,p,B)G'B(^,A) 

(30) j |K(^,A)(y)|||?/irrfn< ceA'^/^T^) Vyn G M, 

where the constant C5 depends only on 03,04, L and d, and cg > depends 
only on m and d (both constants are explicit in the proof of the lemma). 
Furthermore, 

(31) ||K(^,A)lli<C7lnlnA-i and ||K(^,a)II2 < csA'^^+'^-^^/^^t/?)^ 
where the constants cj and cg only depend on d. 



6.2. Weight systems. 



Weight system for zone V4 (no local structure). The construction of 
K(-_4 A) is trivial when A is in the zone V4 of no local structure. In this case 
a basic boxcar kernel tuned to the smoothness of the composite function 
can be used. Observe that when A G (0, 1]^ the smoothness of the compos- 
ite function equals to 7/?, and when A = (7, /3) satisfies 1 < /3 < 7 < 2 the 
smoothness is (3. So, we define the weight K(_4 ^) for the zone V4 as follows: 

. , /(2Ai/(7/3))-YAV(./3)AM]<^(y). if^=(7,/3)G(0,l]2, 
{AA)lyJ \(2AV/3)-^ij_^,/,^^,/,,,(y), ifl</3<7<2. 



Here ]I^(-) stands for the indicator function of a set A. The proof of Lemma 
1 is straightforward. 
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Weight system for zone Vi (local single-index model). The zone of local 
single-index model is Pi = = (7, /3) : 7 < 1, 1 < /3 < 2}. For any A^Vi 
and A > consider the hyperrectangle 

n;,(^) = [-AV7, X [_AV{7/3),Al/(7/3)]'i-l 

and define the weight ^{a,X) ^ follows: 

(32) K(^,,) = (2'^AV7+(rf-i)/(7/3))-i]i^^^_^^(y), y e M.^, 

The proof of Lemma 2 is evident. 

Weight system for zone V2 U V3 (RISD local model). The zone of RISD 
local model is U = = (7, /3) : 1 < 7 < /3 < 2}. The definition of the 
weight in this case is more involved. Indeed, taking K(_4 as a simple product 
of boxcar kernels (32) results for ^ G U "Pa in too large approximation 
error. 

Our aim is to construct a weight ^i^^^X) — > IK with the following prop- 
erties: 

- for some c> 0, it should vanish outside the set [cf. (8)] 

{y G M^: < cA^/^, ||y|| < c\^'^^'^\\yi\^-'^\\yf < cA}. 

— for a function q{yi) of the first component yi of y G W^, the "characteristic 
size" of \^(A,x) should be A^/'^; for a function Q{y2, ■ ■ ■ , yd) of the remaining 
components y2T--,yd it should be A^^^. Namely, we want to ensure the 
relations 

/ \<iA,x){y)q{yi)dy = {2X'/''r' £^^lQ{yi)dyi 

and 

J ^{A,x)iy)Qiy2,---,yd)dy 

^ (2AV/3)-(rf-i) / ... / Q{y2,...,yd)dy2---dy,. 

These properties are crucial to guarantee that the bias of linear approxima- 
tion is of the order 0(A) (cf. Lemma 3). Note that the simple rectangular 
kernel (32) used for the local single-index model can attain such a bias, but 
only at the price of too large L2-norm (which characterizes the variance). 
We now give an example showing how a weight with the required properties 
can be constructed in a particular case. 
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2^ , 



The two-step weight. Set 

(33) ui = X^/\ n2 = Ai/^, 
ni,i = [0,ui] X [V2,VI]'^"\ 

n2,l = [ui,U2] X [V2,vi]'^~'^, 

Next, we define, for y G M^, 

(34) A(y) =^^jlni.i(y) -/^2,iIIn2,i(y) + A^zPn2.2(y)- 

For y = {yi, . . . ,yd) € M'^ we write \y\ = {\yi\, ■ ■ ■ , \yd\) and define the weight 



= ui{vi - V2)'^^^; 
112,2 = {u2 -ui)v^-^; 

fJ'2,1 = {U2 - Ui){vi - V2)'^~^. 



K 



(35) 



for ?/ E M by the relation 



K 



(^,,)(y) = 2-'^A(|y|). 



We will call this weight the two-step weight (cf. Figure 4). Its key property 
is as follows. First, for any integrable function q{yi) of the first coordinate 
yi we have 

^{A,X)iy)(l(.yi) dy = — J q{yi)dyi, 

since the integral of q over 112,1 is exactly the same as that over 112,2 • Further, 
for any integrable function Q{y2, . . . ,yd) of y2, ■ ■ ■ ,yd, 



^{A,x)iy)Qiy2,---,yd)dy 



(2x^2) 



-(d-1) 



V2 
-V2 



V2 



Qiy2,---,yd) dy2---dyd, 



since the integral of Q over 112,1 is exactly the same as that over IIi^i. In 
other words, the negative term — 1^2 liy) (^4) allows us to compensate 















^1,1 + 








"1 "2,2 + 


"2 i', 



Fig. 4. Pavement Ilij for the two-step weight, d — 2. The weight vanishes in the white 
zones. 
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the excess of the bias introduced by the two other terms, so that the resulting 
bias remains of the order 0(A) (cf. Lemma 3). 
For the two-step weight (35) we have 

J ^{A,x)iy)dy = l, l|K(^,A)lll =3, l|K(^,A)|l2 = ^r,i+/^2,2 + /^2,i- 



We now define 

(d-i)(r,-i) 
" = -0 

and consider the subset {A = (7, /3) : p > (/? — 7)/7} of Vs- It is easy to see 
that for p> (/? — 7)/7 we have 

Since 7 < /3 for A^zVs, this result is better than part (ii) of Lemma 2 where 
K(_4 A) is a rectangular kernel. But we need the condition p> {P — 7)/7. It 
is clearly satisfied when p>l (recall that 7 > l,/3 < 2). For smaller values 
of p we need to add extra "steps" in the construction, that is, to introduce 
piecewise constant weights with more and more pieces of the pavement, in 
order to get the bias compensation property as discussed above. For instance, 
if p + p^ > [since (/3 — 7)/7 < 1, this is certainly the case when p > ^^^ ] 
we need a pavement of five sets Hij in order to obtain a piecewise constant 
weight with the required statistical properties, and so on. We come to the 
following construction of the weight. 

Generic construction. Define a piecewise constant weight ^{a,x) ^ 
lows. Fix an integer r that we will further call number of steps (of weight 
construction). Let (ttj)j=i,...,.r and (vj)j=i^,,,^r+i be, respectively, a mono- 
tone increasing and a monotone decreasing sequence of positive numbers 
with ui = A^/'^, Vr = y^lf^ jl and Vr+\ = 0. We set 

ni,i = [0,ni]x [v2,vif-'^, p-i,i=ui{vi-V2Y~^. 

For z = 2, . . . , r and j = i — l,i we define 

Ilij = [Ui-l,Ui] X [Vj + l,Vj]'^~'^, PiJ = (Ui - Ui-l){Vj - Vj + l)'^~'^. 

For y G M^J. consider 

Ai{y) = —Iu,,M; 

A,(y) = —In,., (2/) - ^IIn,„_,(y), i = 2,...,r. 

Pi,i— 1 
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The weight K(_4 is defined for y = {yi, . . . , yd) G K'^ as follows: 

(36) K(^,,)(y) = 2-'^^A,,(|2/|), 

i=l 

where |y| = (|?/i|, . . . , Clearly, 

J K(^,A)(y)dy = 1, ||K(^_A)lli = 2r- 1. 

Construction of the weight for AgVs = {A :l<7</?<2,/?< d{'y — 1) + 
1}. If /9 > we define \^[a,x) ^ ^ two-step weight, that is, we set r = 2 
and take (uj) and (vj) as in (33). 

If p < we use another definition. We introduce the sequence (afc)fc>o 
as follows: 

k+l 

(37) ao = p-\ ak+i = akP + p-^=p-^J2p'^ k = l,2,.... 

The sequence (a^) is monotone increasing and, since P < d{'y — 1) + 1, we 
have 

lim Ok = oo, if p > 1, 

k^oo 

(38) 

hm afc = (/? - (7 - l)(d - 1))-^ > -, if p < 1. 

fc— >oo 7 

Thus we can define an integer r > 2 such that 
/ , 1 

(39) Ctr-l > — > Otr-2- 

7 

Note that r depends only on ^ = (7,/3) and d. Now we set 



Ai/7, n,; = A°— , i = 2 r; 



(40) 

= AV/3nr^(7-^)/^ i = l,...,r-l. 

Recall that Vr = ^A^^^ and Vr+i = 0. If p < define the weight K(_4_;^) by 
(36), with the sequences (uj) and (vj) as in (40). 

Note that for p > the weight K(_4 is just the two-step weight. The 
corresponding pavement {Iljj} only contains three sets (cf. Figure 4). 

Construction of the weight for A£V2- We consider now another choice 
of the sequences (uj) and (wj), which provides the weight K(a,x) with the 
properties similar to those of Lemma 3 but satisfied for all ^ G U and. 



28 



A. B. JUDITSKY, O. V. LEPSKI AND A. B. TSYBAKOV 



what is more, uniformly over this set. The price to pay for the uniformity is 
an extra loglog(l/A) factor in the bound for the Li-norm of K(_4 ;^). 

If (/3 — 7)/7 < {1 + p)p we define the weight as in Lemma 3. If (/3 — 
7)/7 > {1 + p)p we use another definition of sequences (ui) and (vi). For any 
< A < 1 we define 



(41) ViX)=ln\ i^_ZA^in(l/A) \. 



(7 -l)(/3-7) 
7/32 

If V^(A) < we define K(_4 as a two-step weight, that is, we set r = 2 and 
take (uj) and (vj) as in (33). If V{X) > we define r = r(A) > 1 by 

r = mm<^ s G N:s > 1, — — < - In 

L s — 1 2 V 2 

Next, set a = -71^, = ( v^+^ )i/^ and define the sequences (uj) and (fj) as 
follows 

Ui = X^^"' exp| — exp(a(i — 1))|, i = 1, . . . ,r, 

(42) ^" 

^.^ ^1/(7/3) exp{-z/exp(m)}, i = l,...,r-l, Vr = ^X^/^. 

Note that = A^/^. 

Some remarks are in order here. 

1. The number of steps r in the construction of the weight is typically small. 



Moreover, for l<7</5<2we have 



In particular, r = 2 if p > ^^^y^, and r = 3 if (1 + p)p > > p [cf. (39)]. 



(7-l)(/?-l) ^ (/3-l)2 ^1 
7/32 - /?3 -8' 

Hence, V{X) < ln(^^) for ah A > 3 • 10"^, which means that for (1 + 
p)p ^ more than 3 steps of the construction are needed if A > 

3 • 10~^. In other words, unless we are not "extremely far" in the asymp- 
totics, the number of steps r does not exceed 3 and thus the Li-norm of 
the resulting weight ^(a,x) is bounded by 5. 
2. In the asymptotics when A ^ the number of steps r = r(A) in the con- 
struction and thus the Li-norm of the weight ^[j\^^x) is at most 0(lnln A~^). 
As discussed in the previous remark, this behavior starts "extremely far" 
in the asymptotics, so it has essentially a theoretical interest. In the the- 
ory, it results in an extra Inlne"^ factor in the upper bound for the 
estimation procedure, as compared to the lower bound in (5). It can be 
shown that for AGV2 sl weight with the required approximation proper- 
ties cannot have the Li-norm growing slower than InlnA"^, as A ^ 0. On 
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the other hand, as we have seen in Lemma 3, for AgVs solely, there is a 
choice of sequences (uj) and {vj) such that the Li-norm of the weight is 
bounded by a constant independent of A. This constant, however, depends 
on ^= (7,/?) and explodes as A approaches the boundary of V3. 



7. Proofs. 



7.1. Proof of Theorem 1. For any /? > 0, 7 > and any < e < 1 define 
the integers 



r(5vV(i7^)-2/(w+(d-i))^_ 



Consider the regular grid Tq^ on [0, If-^ defined by 

^\\ 2qi 2^1 j-ki&{0,...,qi-l},t = l,...,d-l 

Denote by xi, . . . , Xm, where m = card(rgj = qf~^, the elements of Tg^ num- 
bered in an arbitrary order. 

Let /o:M^M_|_ be an infinitely differentiable function such that /o(0) = 
1, /o(^^) = foi-u) for ah It G M, /o(n) = for u ^ [-1/2, 1/2], and /o is strictly 
monotone decreasing on [0, 1 /2] . Examples of such functions can be readily 
constructed; compare [27], page 78. Set 

1 

(i2, . . . = - n /o(ii) V(t2, . . . G M''-i 



and 



where h = h^jhi = 1/qi and < Lq < 1 is a constant to be chosen small 
enough. Consider the following collection of infinitely differentiable functions 
of i=(ii,...,trf)GM'^: 

9k{t) = f{Gk{t)) = Loh^o , A: = 0, 1, . . . , m, 

where 

Go{t) = Losinti, 

Gk{t) = Losmti + Loh'lipol — h^'" ' ' j ' k = l,...,m 

and Xkj stands for the jth component of Xk- We note that, in view of the 
above definitions, the sets where the functions gi and differ from go are 
disjoint for / / fc. A; 7^ 0, / 7^ 0. 
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It is easy to see that if Lq is small enough, G IHI(^, £), k = 0, . . . ,m. In 
what follows, we assume that Lq is chosen in this way. To prove Theorem 1, 
we follow the scheme of lower bounds based on reduction to the problem of 
testing m + 1 hypotheses (cf., e.g., [27]). We choose the hypotheses to be de- 
termined hy go, . . . , Qm and we apply Theorem 2.5 of [27], where we consider 
the sup-norm distance d{gi,gk) = \\gi - gk\\oo = supjg[_i \gi{t) - gk{t)\, 
l,k = 0,1, ... ,m. Since the functions gi and g^ differ from go on disjoint 
sets, for any I ^ k,l,k = 1, . . . ,m, we have 

d{gi,gk) = d{go,gk) > Loh''\MO) - /o(Lo/i?<^o(0)//i)| 

= Lo/i^|/o(0)-/o(Lo(l + o,(l))/2)|, 

where Oe(l)^0, ase^O. Since Lq > and /o is strictly decreasing on 
[0,oo) there exists a constant L* > such that, for e small enough. 



(43) 



d{g„gk) > L*h^ X (ey^i^)(2-)/(2^+^+('^-^)/^), 

I ^ k,l,k = Q, . . . ,7X1. 

Thus, assumption (i) of Theorem 2.5 in [27] is satisfied with s = L*K' /2. It 
remains to check assumption (ii) of that theorem. The probability measures 
Pgj. are Gaussian, and the Kullback-Leibler divergence between P^^^ and P^g 
has the form 



Kf 



• gk^'^ go/ 



V 



{9o{t) - 9k{t)Y dt 



V 



Lq sinti 



h 



fo 



Lq sinti 
h 



+ W{t2 



,td) 



dt, 



where we write for brevity 

w{t2,...,td) =LQipQ 
Since, for any a, w G R, 



Ml 



h - Xk,2 



w 



td — Xk,d 



fo 



+ uw 



du. 



we find 
K(P 



gk i'^ go) 



<e-^Llh^^ 



w'^{t2. 









f 


VJ\t^\<\v\ 





, td) dt2--- dtd 

Lq sinti 
h 



+ Uw{t2 



,td) 



du 



du. 
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where \D\ is the Euchdean diameter of P. Since /o is supported on [—1/2, 1/2] 
and \w{t2, ■ ■ ■ ,td)\ < 1/2, the values /o((Losinti)//i + uw(t2, ■ ■ ■ ,td)) under 
the last integral can be nonzero only if LqI sintij < h. The Lebesgue measure 
of the set {ti : < IP], LqI smti\ < h} is 0(h), as /i — > 0. Hence, the double 
integral in the last display is bounded by c^:h for all h small enough, where 
c=i, > is an absolute constant. This yields 

< c^^Lq ln(l/e), 

where c=n, > is an absolute constant. Next, m = qi~^, so that Inm x 
ln(l/e). This and the previous inequality imply that if Lq is chosen small 
enough, we have 

(44) K(P,,,P,J<(l/16)lnm. 

Using (43), (44) and applying Theorem 2.5 in [27] we get the lower bound 




-(27)/(27+l+(d-l)//3) 



(45) 

X ||ffe-5l|oo)1 >0, 

which is valid for all /? > 0, 7 > and all p > 0. 

We now show that for the trivial cases discussed in Section 2 we can obtain 
better lower bounds. Consider first the case where < /3, 7 < 1. Then we use 
the same technique as above, but we set now qi = [(e\/ln(l/e))~^/(^'>'^+'^)] . 
We then introduce a regular grid F*^ on [0, 1]*^ defined by 

p*a//2^i + 1 2fcrf + l\ \ 
T^, = [[-^,...,^^):he{0,...,q,-lU = l,...,dj 

and denote by xi, . . . ,Xm, where m = card(r*J = qf, the elements of F*^ 
numbered in an arbitrary order. We set now 

d 

and we choose the functions gk in the following way: 
9o{t) = 0, 

t£W^, k = l,...,m, 

where h = \/qi. Note that for sufficiently small h we can write these func- 
tions as compositions Qk = f ° G^, where f{u) = L'q\u\'^ fo{u) , Go{t) = 0, 



9kit) 



h 
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Gk{t) = L^h'^ipodt - Xk)/h) and L'q = L^J^^^^^^ with a slightly different defi- 
nition of /o than above. Namely, we choose /o to be infinitely differentiable, 
supported on [-1/2,1/2] and such that fo{u) = 1 for u G [-1/4,1/4]. It is 
easy to see that if Lq is small enough, gi^ G M.{A, C), k = 0, . . . ,m. With this 
choice of gk we get 



(46) d{gi,g,) > Llh?^^l{Q) x {e^Hl/ e)f'^P^'^^'^P^''\ 

I ^ k, I, k = 0, . . . ,m. 

Next, 

K{Fg^,Fj = e-^ f {go{t) - gk{t)f dt 
Jv 

(47) < L^^e-2/i27/3+'^ f ^l\v)dv 

= 0(ln(l/e)) ase^O. 

Using (46), (47) and Theorem 2.5 in [27], the proof is completed as in the 
previous case, so that we get the lower bound 



(48) hminfinf sup ¥.g[{{EJ\n{l/e))-^^^^^/'^^^^+''\\~g,-gUf]>Q, 

which is valid for all < /3, 7 < 1 and all p > 0. 

Finally, the second trivial case where (45) can be improved corresponds 
to 7 > /3 V 1. As observed in Section 2, in this case we have the inclusion 
]HIrf(/3, L4) C ]H[(^, C) with some constant L4 > 0, and we can use the standard 
lower bound for Udifi^L^) (cf. [2, 3, 6, 23]): 



(49) liminfinf sup Eg[((e7ln {l/e)y'^'^^^'^'^^^'^'^\\ge - g\\ooY\ > 0. 

Combining the bounds (45), (48) and (49) we obtain the result of Theorem 
1. 

7.2. Proof of Theorem 2. We need the following technical result. 

Lemma 5. Let (" = (<^i, . . . , C_a/() he a Gaussian random vector defined on 
a probability space {Q, P) and such that ECm = 0, E^^^ = o"^, m = 1, . . . , . 
Let m be a random variable with the values in (1, . . . , A^) defined on the same 
probability space. Then for all A> 1 and all s > we have 



* m=l,...,M J 

where ci2{A,s) > is a constant depending only on A and s. 

Proof is standard (see, e.g., [14]). 

To prove Theorem 2 we proceed in steps. 
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1° . Reduction to the discrete norm. Fix A = {-f,(3) £ (0,2]2, and suppose 
that g S M{A,C). Let, for brevity, g* = g'^^. In view of the construction of 
the global estimator [cf. (19)] we get, for all g £ ]H[{A,C), 

ll^e -^lloo < sup max \g*{x)-g{x)\ 
2gzda:Gn,{^)n[-i,i]d 

(50) 

<\-g:-g\^ + Cs'^(P-'\ 



where 



gl - g\^ ^ max j^^z) - g{z)\ with Z, = {e'zf n [-1, l]'^. 



Here and in what follows we will use the same notation C for possibly 
different positive constants depending only on £ and d. Since e^'^^P'^^) = 
o(</)e(7, /?)),£ 0, for all (7, /3) G M^, it is sufficient to prove Theorem 2 with 
the loss given by the maximum norm | • |oo on the finite set Z^. Thus, without 
loss of generality, in what follows we will replace \\ ■ ||oo | • |oo- 

^ . Control of large deviations. To any z € Zf, we assign a vector 9^ € 
such that \\6^ - 9q\\ < where is defined in (20). Next, we set Jq = 
(^, 0^, Ae(^)). Introduce the random event 

where is the set of acceptable triplets J defined in Section 5. We now 
show that for all e > small enough 

(51) sup ^g[T)<cx2e^^, 

where the constant C12 depends only on d. Indeed, in view of the definition 
of the random set T^, 

.FC U U {|A^,5^,,^.(z)|>TH,(j',Jo^)} 
and therefore 

(52) Pg(.^)<E E F,{|A^,5j'*J^(^)l>TH,(j',Jo')}. 

ZdZ, J'eagrid 

Note that 

^g'^J'dJ'^J^.iz) = [^J'Kj'^j^*g]{z). 

Applying Proposition 2 with = {A,0^ ,\e{A)) and A = Aq = Ki^A) we 
obtain, 

sup \¥.gAj,gj>^j4z)\ 
gm(A,C) 

(53) 

< cn{A,(^)(||ir^nii + Ili^^oHli) + ll^^^'llill^J^ lli^'}- 
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Now, due to the construction of the weight \^{a.X) ^^'^ the fact that H-fCj^Hi = 
ll^^(^,Ae(.4)) 111 for all J' G 5gridi there exists a constant C13 depending only on 
A and d such that = m.axj^2gnd ll-^J'lli satisfies 

K\<ci3lnln{l/e), ifAeVi- 

Since also H-fi'j'lli > 1 and A£(^)/(elnln(l/e)) -^00, as e ^ 0, we have, for 
e > small enough, 

sup \EgAj'gj'^j^{z)\ 
gmiA,c) 

(54) <2cuXe{A)mj'\\i + \\Kj^A\i) 

= 2£^ln(l/6)||K(^,;,^(^))||2(||i^J'||l + l|i^JoHll)' 

where we used that Ae(^) is a solution of (17). Note also that in Fg- 
probability 

(55) Aj,gj>^j^^{z) - EgAj>gj>^j^^{z) -^M{0,£'^\\Aj>Kj>^j^^\\l). 

Using (22), (52)-(55) and the definition of the threshold TH£(-, •) we obtain 
that, for £ > small enough, 

^giJ") < card(Z,)card(Se)P{|C| > ^ (Ap + 8d)ln{l/e)} 

where ^ ~ AA(0, 1). This proves (51) since card(2e) < (2e~^ + 1)*^ and card(S£) < 
{Vd/ey-K 

3°. Two intermediate bounds on the risks. Using that |^*| < lnln(l/e) 
and g £ M{A,C) is uniformly bounded we deduce from (51) that, for all 
^=(7,/J)G(0,2]2, 

(56) hmsup sup Eg{(t>^P{^,PM-g\P^I{T})=0. 

e^O gm{A,C) 

We now control the bias of gj^ via Proposition 1, its stochastic error via 
the bounds on ||K(_4_;^^(_4)) ||2 in Lemmas 2-4 and apply (17) to get that, for 
all^=(7,/?)G(0,2K 

(57) limsup sup Eg{cl)~P{j, P)\gj^ - g\l^) < 00. 

e-^o gem{A,c) 
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^° . Final argument. Note that on the event J-'^ the set T2 of acceptable 
triplets J is nonempty for every z € Z,., so that exists. Thus, on J^"^ we 
can write, for all z £ Z^, 

(58) \g j^iz) - g{z)\ < \A^J^^^j,{z)\ + \Aj^^g j^,^^^{z)\ + \gj^^iz) - g{z)\. 

Further, on J^'^ the triplet J^q is acceptable for all z £ Z^. This and the 
acceptability (by definition) of J'z imply that on for all z £ Z^, 

(59) 

This, the definition of the threshold THg and the fact that ||-ftrj'||2 = ||'^(y4,Ae:(y4))l|2 
for all J' G 3grid yield that on JT*^, for all z £ Z^, 

\g J (z) - g{z)\ < 4C{p, d)K*J K(^a,xAj^)) h^^HV^) + \9J,^ (^) " 5(^)1 

(60) 

= ^Cip^dy-^lK^UA) + \9Js{z) - g{z)\. 

We combine (57) and (60) to get, with some constants C14 — cie independent 
of £, 

sup E^d^: - fflSoM-^'}) < Ci,{K*^Xe{A)Y + C15<A?(7, /?) 
gm{A,C) 

(61) 

<ci^{K\<Pe{l.li)7. 
Theorem 2 follows now from (56) and (61). 

APPENDIX: PROOFS OF AUXILIARY RESULTS 
A.l. Proof of Proposition 2. 

1°. Preliminary remarks. For any J £2 and any x £ [-1,1]'^ we 
may write 

[AjKj^jg*g\{x) 

= [Kj^j- * g\{x) - [Kj * g\{x) 

= Kj{y-x)Kj^{t-y)dy^g{t)dt-[Kj*g]{x) 
= j Kj{y-x)(^j Kj^\t - y)g{t) dt^ dy - [Kj * g]{x) 
= j Kj{y- x)g{y) dy - [Kj * g]{x) 
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(62) + j Kj{y-x)(^j Kj^^it - yMt) - giy)] dt^ dy 
= jKj{y-x)(^J Kj^it - y)[git) - g^y)] d?j dy 
= I Kj{v) 

= j K(^,;,)(Mji;) j K(j^^x){Ml.z)[g{z+v + x)-g{v + x))dzdv. 

Define G,(-) = G{- + x) and /,(•) = /(• + G(x)). T\ieng{z + v + x) = f{G,{z + 
v)) and g{v + x) = f{Gx{v)). Note that, for all x G [-1, 1]"', 

(63) G,GEId(/3,L2), /,GE[i(7,Li). 
If 1 < 7 < 2, the second property in (63) implies 

(64) /;;.GE[i(7-l,2Li). 

In the case where 1 < /3 < 2, for all u G W^^x G [—1, l]'^ we define Gx{u) = 
G^iu) - Gx{0) - [VG^(0)]^u. In view of (63), for all x £ [-1, l]'' we have 

(65) l|VG^(n)|| < 2L2 VuGM'^, 
\Gx{t)-Gx{u)-[VGx{u)f{t-u)\ <L2\\t-u\f yt,ueR'^, 

(66) 

^\Gx{u)\<L2\\u\f, ueR'^. 
It follows from the definition of K(_4_;^) and Lemmas 1-4 that 

(67) J \\v\\^^\K(^j^^x){v)\dv<c'QX G (0, 2]^, A > 0, 

where Cg > is a constant depending only on £ and d. Furthermore, for any 
A = (7,/?) G (0,2]^ and any A < 1 the support of ^(a,x) is contained in a 
ball {u G M'^ : ||u|| < c/^A^/^'''^^} where the constant ck > depends only on 
d. Therefore, 

(68) K(^A,x)iMju) = Vu,i9gIR'^:||u|| >c/^A^/(^'^), M\ = l- □ 

2°. Proof for the zone of RISD local model: 1 < 7 < /? < 2. 
Using (63) and the Taylor expansion for Gx we obtain, for all x G [—1, l]"^, 
z,veR'^, 

g{z + v + x) = f{Gx{0) + [VGx{0)f{^ + v)+Gx(z + v)) 

(69) 

= /..([VG,(0)]^(z + v) + Gxiz + v)). 

Note that, by definition, VG^(O) = VG(x) = ??g||VG(x)||. Set VG* = 
i?^||VG(2;)|| and define 

g^{z + v + x) = fxi[VG^f{z + v) + Gx{z + v)). 



J Kjg{z){g{z + v + 



x) — g{v + x)) dz 



dv 



ESTIMATION OF COMPOSITE FUNCTIONS 37 

We now approximate g{z + v + x) by g^:{z + v + x) in the last line of (62). 
In view of (68), it suffices to consider there only the values z,v satisfying 
\\z\\, \\v\\ < CK- For such z,v and all x G [—1, 1]"^, the condition W^Oq — <£ 
and (63) imply 

(70) \g{z + v + x)-g^{z + v + x)\ < 2cKLi\\VG{x)\\e < 2cKLiL2e. 

Using (63)-(66), the Taylor expansion for fx and (64), we get that for all 
X E [—1, 1]"^, z,v the following representation holds: 

g^iz + V + x) = fxiiVG^f {z + v)) 

+ fU[VG,f{z + v))Gx{z + v) + Bx,i{z,v)\\z + v\\''^ 

= U[VG,fiz + v)) 

+ [K{NG.f{z + v))-fU[VG.fv)] 

(71) 

X iG,iv) + [VG,{v)fz) 
+ fU[VG,fv)iGx{z + v)- Gx{v)) 
+ fU[VG,fv)Gx{v) 

+ Bx,2{z, v)\[VG,fz\^-^z\f + Bx,i{z,v)\\z + v^^ 
where, for all x £ [-1,1^, z,v £ R'^, Bx:,i{-,-) and B 

x,2(';') are functions 

satisfying 

(72) \Bx,i{z,v)\<LiLl \Bx,2{z,v)\<2LiL2. 
Putting z = in (71) we obtain 

(73) g^v + x) = UiVG.fv) + f'x{[VG,fv)Gx{v) + B^M^MM^^ ■ 
From (71) and (73) we get, for ah x £ [-1, 1]*^, z,v£ R'^, 

g*{z + V + x) - g^.{v + x) 

= /x.([VG,]^(z + v)) - MVG.fv) 

+ m-^G.fiz + v)) - fU[yG.fv)]{GAv) + [VGx{v)fz) 

(74) 

+fUNG,fv)iGAz + v)-Gxiv)) 

+ Bx,2{z, v)\[VG,fzr%\f + Bx,i{z, v)\\z + v^" 

-BxM^)Mr^- 
Put u = MJ^v, s = MJ^z. We get from (74) that 

g^ [M^x s + Mi)x u + x) — g^ {M^x u + x) 

= {fx{si +Ui) - fx{ui)) 
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(75) +Au,Asi)(G.{u) + [VG^{u)fs) 

+ fU\\VG{x)\\ui)(G,{s + u)- G^iu)) + B,,2{s,u)\si\''-'\\sf 

+ B,,i{s,u)\\s + ur^ -B,,i{0,u)\\u\\""', 
where si and ui are the first components of s G M'^ and u S M"^, respectively, 
Uui) = /x.(||VG(x)||ni), G,{u) = G,{M^.u), 
Bx,iis,u) = Bx^i{M^xs,Mi)xu) 
B,,2{s,u) = ||VG(x)p-iS,,2(M^.s,M^.n) 

and 

AuAsi) = fU\\^G{x)\\{si + m)) - fU\\VG{x)\\ui). 

It is easy to see that inequalities (65) and (66) remain valid with Gx in place 
of Gx. 

Now for all x G [— 1, l]'^, s, u G M'^ we introduce 

QuAsi) = {fx{si + ui) - U{ui)) + AuAsi)(Gx{u) + [VG^.(n)]^rsi) 

+ m'^G{x)\\ui)[VGx{u)Y'd-si, 
PuAs) = /^(||VG(x)||^xi)(G,(s + u)-Gx{u) - [VGx{u)fs), 
B^^^{s) = Bx,2{s,u), 

Qu,x{s) = qu,x{si) + pu,xis) + Bx,2{''^,u)\si\^~'^\\s\f , 

PuAs) = fx{\\yGix)\\isi+ui))[VGxiu)fs^, 
where s± = s — si??^. With this notation (75) can be written as 

[M^x s + Mi)x u + x) — {M^x u + x) 

(76) 

= QuA^) + PuA^) + ^xm(s, u)\\s + uir" - BxAO, ^)ll^ir^- 

We now prove that, for all x G [—1,1]°' and all u G M"^ such that ||ii|| < 
CA'A^/^'^^) [cf. (68)], the triplet {qu,x,Pu,x, B''''') belongs to the set 53(^,A) 
(cf. definition before Lemma 3), and thus Lemmas 3 or 4 can be applied. 
We need to check (23)-(25). 

Checking (23). In view of (63) we have 

\fx{si + ui) - fx{ui) - f'^{ui)si\ < LiL2|siP. 

Therefore, 



1 r^o 



1/7 



{fx{si + ui) - fx{ui)) dsi 



1/7 



< —T7Z / \siVdsi 



(77) 
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Next, remark that (64) implies |^u,a;('5i)| < 2LiL2^^\si\'^~^ . Furthermore, 
(66) with Gx in place of Gx yields |Gx(ti)| < -L2||'w||^- Now, qu,x{0) = and 
using these remarks, (77) and (65) we get, for ||n|| < ckX^^^'^^\ 



(78) 







1/7 



- + ^ \^uAsimGx{u)\ + ||VG.(n)|||.i|)(i.i 



An 



2 V7 7 + 1 / 



L1L2 



+ 2L1L2' 



V 7 



+ 



7 + 1 



A < C3A, 



where the constant C3 depends only on C and d. It can be taken as a maxi- 
mum of the last expression in square brackets over (7,/?) € [l52]^. 

Checking (24) and (25). It suffices to note that, for all x G [—1,1]'^, the 
first property in (66) with Gx in place of Gx and the second property in (63) 
yield 

- PuA^) - NPuAs)fis' - s)\ < \fU\\^G{x)\\ui)\L2\\s' - sf 

< LiL2\\s' - s\f ys,s'£R'^. 

This proves (24) with 6 = /? and L = L1L2. Finally, (25) with B = S"'^, 
C4 = 2LiL2 follows from (72). 

We are now in a position to apply Lemmas 3 and 4. We demonstrate 
this, for example, for Lemma 4. Take there q = qu,x,P = Pu,x,B = B^'^ for 
any \\u\\ < Ci^-A^/^T^) and x G [-1, l]'^. Since Qu,x{0) = 0, the result (29) of 
Lemma 4 yields 



(79) 



^{A,X){s)Qu,x{s) ds 



< C5A, 



where C5 depends only on C and d. Furthermore, by construction the weight 
^[A,\) is symmetric, that is, ^(A,X)i^) — ^{A,X){~^) hence 



j ^{A,X)is)Pu,xis)ds = 0. 



(80) 

Next, using (72) we find 

|B,,i(s,n)||s + ^z|r^-5,,i(0,n)||nP^|<2^%L^(Pir^ + ||nP^). 
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Combining this inequality and (79)-(80) with (76) we get, for all x € [—1, 1]"^, 

J K(^,A)(s)(5'*(M?^'S + M^xu + x) — g^{M^xu + x))ds 



<C5A + 2T^LiL] 



K(AA)(^)IPir''d^+l|K(AA)llil|n|r'^ 



We finally get (21) from this inequality invoking (67), (62), (70) and recalling 

that ||K(^,;,)||i = ||J^j||i for all^G(0,2]2,A>0, and ||K(^,;,)||i = 

□ 

3°. Proof of (21) for the local single-index zone: 7 < 1, 1 < /3 < 
2. Using (66) and the second property in (63), for all z,v M"', x G [—1, 1]'' 
we may write 

g4z + v + x)= U[VG,f{z + vj) + S,,i(z, v)\\z + vW"'^, 

where Bx^i satisfies (72). This can be viewed as a simplified version of (71). 
Following almost the same argument as in 2° (the main difference is that 
now we drop all the terms containing and i?x,2) and applying Lemma 2 
we obtain (21). □ 

4°. Proof of (21) for the zone of slow rate: (7,/?) G (0,1]^. Us- 
ing the Holder condition on / and Gx we obtain, for all z,v ^ W^, x £ [—1, 1]*^, 

g{z + v + x) = f{Gx{z + v)) = /(G,(0)) + Bx,i{z,v)\\z + vf'' , 

where Bx^i satisfies (72). Now, (21) easily follows from this relation, (62), 
(67) and the definition of ^{a,x) for the zone of slow rate. □ 

5°. Proof of (21) for the zone of inactive structure: l< (3< 
7< 2. Since /G]Hi(7,Li) and ||VG^(-)|| <L2, for aU z,veR'^,x£ [-1,1]'' 
we may write 

fiGxiz + v)) = f{Gx{v)) + f'{Gx{v)){Gx{z + v)- Gx{v)) + Bx,i{z,v)\\zr 
= f{Gx{v)) + f'{Gx{v)){Gx{z + v)- Gxiv) - [VGxiv)fz) 

+ f{Gx{v))[VGx{v)fz + Bx,i{z, v)\\zr 
= f{Gx{v)) + f\Gx{v))[VGx{v)fz + Bx,2{z,v)\\zf 

+ Bx,i{z,v)\\z\\\ 

where Bx^i satisfies (72) and \Bx^2{-r)\ <ii-^2- Since the weight is 
symmetric, 

" K^j^^,,^{M^^z)f{Gx{v))[VGx{v)Vzdz = 0. 
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Now, (21) easily follows from these relations, (62), the definition of ^{a,\) 
for the zone of inactive structure and the condition A < 1. □ 

6°. Proof of (22). For a function K G L2(M'^), let us denote by K its 
Fourier transform. Using Parceval's identity we obtain, for any J', J' E 3, 



Since J Kj' = 1, this proves (22). □ 

A. 2. Proof of Lemma 3. First, note that some cases are trivial because 
the number r of steps of the weight construction is bounded by 3. In fact, 
if (p + l)p <{(3- 7)/7 and V{\) < ln(^^) we have r < 3 by definition. If 
(p + l)p > {(3 — 7)/7 we use the weight as in Lemma 3. But for this weight 
the condition (p + l)/0 > (/? — 7)/7 implies that, again, r < 3. 

So, we will treat only the remaining case where (p + l)p < (/? — 7) /7 and 
V{X) > ln( ^"^^ ). The last inequality implies that r > 3. 

Note that, by definition, a < ^ ln( ^^^ ). Further, for r > 3 we have also 
the lower bound: ce> \ ln( ^"^^ ). Thus for r > 3, 

(81) 0.786< (^)-"%.-"< (^)-"% 0.887. 
1°. Proof of (29). From the definition of ^{a,x) we find 

T 

[\<{A,x)*Qm = 2-^Y. I K{\y\)q{yi)dy = 2-'' j Ai(|y|)g(yi) 

i=l 

^ i[o,ni](yi)c?yi, 



<C3A. 



Ui 

where ui = X^^"' . This and (23) imply 

rAi/7 



(82) |[K(^,A)*g](0)-<z(0)| 



(2AV^)-i / q{yi)dyi-q{0) 

A1/7 



We now obtain a similar bound for |[K(_4 a) *p](0) — p{0)\. Note that, in 
view of (24), for all z = {zi, . . . , z^) S M'^ we have 

(83) p{z) = p{z) + zi^^{0,Z2,...,Zd) + Bi{z)z^, 
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where p{z) = p{0, Z2,. ■ ■ ,Zd) and sup^gjgd |i?i(2)| < L. For the same reason, 
for all ■Z(d-i) — {0, Z2, ■ ■ ■ , Zd) we have 

(84) p{z) =p(0) + [Vp{0)fz^d-i) + B2{z^d-i))\hd-i)f, 

where as previously |-B2(')I ^ ^- Combining (83) and (84) and taking into 
account that the function y^[A,X) is symmetric, / K(_4 = 1 and p{Q) = p{0) 
we get 



(85) 



Now 



I[K(^,A)*P](0)-P(0)| 

K(^,,)(z)(i3l(z)zf + i?2(-ld-l))IN(d-l)f)d^ 



^{A,\){z)B2{z^d-~i))\\z{d~i)f dz 



(2(?;i -va))^ / 52(2(d-l))IN(d-l)|^I[„2,^,l]d-l(N(d-l)|)c^2(d-l) 



r-l 

+ E 

i=l 



(2{vi - Vi+i)) 



X / B2iz^d~l))\\z(d^l)\^[v,+ uv^]''-^i\Hd~l)\)d'Z(d~l) 



(86) 



-{2{vi_i-Vi)f-'' j B2{z, 



{d-l))\\Z{d-l) 



< (2^;,)'-^/ |52(^(d-i))llk(d-i)ll%,..]d-i(k(d-i)l)'i^(d-i) 

= (Al/^)l-'^ /|i?2(^(d-l))|||^(d-l)||%,Ai//3].-i(k(d-l)|)d^(d-l) 



< 2'^-id^/2^A < 2'^"! dL\, 

where |-Z((i„i)| = (|z2|5 • • • > Further, note that t; > n > 1 implies e'"^'^ < 
e" ju [in fact, ?;(1 — 1/u) > ti — 1 > Inu]. Using this remark and the fact that 
> 1 we find 



(87) 



exp { — ^ — exp(a(i — 1))^ = A^^''' exp f — ^ — exp(m)e 
V7-I / V7-I / 



z = l,...,r-l 
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and therefore Ui/ur < e°^' This and the equahty Ur = \^^^ ahow us to 
get 

\a,\){z)Bi{z)z( dz 
<L f\K(^_^^^){z)\\zifdz 



1=2 

< 2L 5^ ^zf < 2LA ^ ( — ) < 2AL ^ e""' = 2AL(1 - e" 



1=1 



1=1 



-a\-l 



=0 



From (85), (86) and (88) we get 



|[K(AA)*p](0)-p(0)|<AL[2'*-^d + 2(l 



We now estimate the value | / \{(^ji^^x^{y)B{y)y1 ^\\y\\^ dy\. In view of (42), 

ul'^v^ < Aexp{/? - i//3e"} < Aexp{(l - i/)/3}, 
(90) , . . . 

Using (90), we get similarly to (88) 

^(A,x){y)B{y)yV\\yfdy 



u] <u] v^_i = Xexp{{l - u)f3exp{a{i - I))}, i = 2,...,r. 



<CAj\K^A,x){y)\\yir'T.\yj\^dy 



(91) 



: C4 



<2C4 



<2C4 



dy + \K^^^,^{y)\\y^r%\f' dy 

j=2-' 



.1=1 



i=l 



^(/3+7-i)//3^g-a«(/3+7-i) + Ad^exp{(l - i/)/3exp(aO} 



<2c4A[(l-e~°)"^+(i(l-e 

where the last inequality holds for < A < 1 and we used that /3exp(a/) > al, 
v > 1. Summing up the results of (82), (89), (91) and taking into account 
(81) we obtain (29). □ 
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2°. Proof of (30). In the same way as above we get, for < A < 1, 

K(AA)(y)iiiyirrf^ < rf'"/'/|K(^,A)(y)i E ly^r^y 



li=l 1=1 



mva \ — 



< C(d)A™/(^^) [(1 - e-™°)-i + (1 - e 

Here and in what follows use the same notation C{d) for possibly different 
positive constants depending only on d. □ 

3°. Proof of (31). Since i/ <2< we have, for < A < 1, 

vr-i ^ Ai/^^'^) exp{-z.exp(a(r - 1))} = \ym+-{i-m-i)/{iP? > aV/3. 

By the definition of Vr this implies that v,—! — Vr > X^^^ /2. Further, as 
Ur = A^/^, in view of (87), we have 

Ur-ur^i > (l-e-")A^/^. 

We deduce that 

(92) flr,r-l > /ir,r > 2^-'^A'^/^(l - e""). 

Note that by (87), 

itj+i - Mi > (1 - e~")ui+i for i = 1, . . . , r - 1. 
Also, as > 1, it is straightforward to check that 

Vi - Vi+i > (1 - e~°')vi for i = 1, . . . , r - 2. 

Thus, we get 

(93) A*i,i = ^i('^i - ^2)^^'^ > (1 - 6-")"^-^ exp(-(d - l)i^e")Ai/^+('^-i)/^. 

Recall that we are considering the case where p{l + p) <{(3 — 7)/7, 1 < 7 < 
P < 2, so that p(l + p) < 1, and thus p < ^"-^ . This and the choice of 
parameters a, v combined with (81) implies 

e -pv>\ — ^ — I -pv>\ — ^ — I ^ — zy = 5> 0.0891. 

Now, 

^ -e-"" -{d-\)v>^^>2b. 



7 — 1 7 ~ 1 
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Hence, for i = 2, . . . ,r — 1 we have 
(94) 

r /3 ^ 

X exp< exp(a(z — 1)) — [d — l)i^exp(m) > 

l7 - 1 J 

> C((i)A^/T+('^-^)/(^^) exp{25exp(m)}. 

Note that 

r r 

(95) ||K(^,A)ll2 = /"M + EK^i ) < +2E^M • 

't=2 1=2 

We deduce from (92)-(95) that 

l|K(AA)ll2 < C((i)(A^/^+('^-^)/(^^) + X-''/^). 

This proves the second inequahty in (31). The first inequahty becomes obvi- 
ous if we note that V{X) < lnln(l/A) and so ||K(_4 ||i = 2r — 1 < cy lnln(l/A), 
for A smaU enough, where cy is an absolute constant. □ 

A. 3. Proof of Lemma 3. Following the same lines as in the proof of (29) 
in Lemma 4 we obtain the bound (26) of Lemma 3 with 

C5 = C{d){c3 + Lr + c^r). 

1°. Proof of (27). By definition, Ur = A^/^ and for < A < 1 we have 
U2 > X^^'^, SO that vi = X^^^U2 ^"^ ^^^^ < ^1/(7/?). Using these remarks and 
acting as in the proof of (30) in Lemma 4 we obtain, for < A < 1, 



K^A,x)m\y\rdu<2d^/' 



Y^uT + dY^vT 

i=l 1=1 

< 2d'"/V(u;" + (it^D < C((i)r A'"/(T^) . □ 



2°. Proof of (28) . Observe that a^+i — > for j = 1, . . . , r — 1, so 
that for A — > we have uj/uj-i — > 00 and Vj-i/vj 00. In particular, 

fJ.j,j-l = {Uj - Uj-l){Vj-l - VjY~^ > lijj = {Uj - Uj-l){Vj - Vj + l)'^~'^ 

for all A small enough. Next note that, by definition, 

a,_2>(a,_i-r')p"'>^^- 

IPP 
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Then U2 < A^^ 7)/(7/3p) g^^id for A small enough we get by the definition of p: 
Further, as Ur = X^^^ and Vr = \X^/^ , Vr+i = 0, 

^r,r > 2-'^A'^/^ 

for A small enough. Next, for 1 < j < r, 

By the definition of the sequence (afc), 

[d- I)/ (3 + Ok - p/ctk^i =d/(3, k = l,...,r-l. 

Thus 

Pj,j > ^x('i-^)/l^+^r-,-par-u+i) = 1 A"'//^, j = 2,...,r-l. 
Substitution of the above bounds into (95) yields 

l|K(AA)ll2<C(rf)A~'/^. □ 
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